Author: Firesledge (aka Cretindesalpes)
Version: 1.0.1
Download: http://ldesoras.free.fr/prod.html#src_avstp
Category: Multi-threading support, development tools
AVSTP is a programming library for Avisynth plug-in developers. It helps supporting native multi-threading in plug-ins. It works by sharing a thread pool between multiple plug-ins, so the number of threads stays low whatever the number of instantiated plug-ins. This helps saving resources, especially when working in an Avisynth MT environment. This documentation is mostly targeted to plug-ins developpers, but contains installation instructions for Avisynth users too.
From the Avisynth user point of view, an AVSTP-enabled plug-in requires the avstp.dll file to be installed.
Put it in the usual AviSynth 2.5\plugins\ directory, or load it manually with LoadPlugin("path\avstp.dll") in your script.
The dll is shared between all plug-ins using AVSTP, so keep only one avstp.dll file in your plug-in set.
If you're updating from a previous version, make sure that Avisynth will get access only to the latest one.
If a plug-in requiring AVSTP cannot find the dll file, it could crash, emit an error, or fall back gracefully on a single-threaded mode, depending on its design and implementation. There is no mandatory or pre-defined behaviour for such a case.
The number of threads is automatically set to the number of available logical processors. The thread count can also be controlled via an Avisynth function, so multi-threading can be disabled globally if not desired.
avstp_set_threads (var c, int n)
This function sets the global number of AVSTP threads running in the Avisynth process.
The call is optional.
When n is set to 1, AVSTP is monothreaded.
When n is set to 0, AVSTP uses the number of detected logical processors, which is the default when the function is not called.
To work smootly, calling the function only once is prefered, at the end of the main script, just before returning the final clip.
Behaviour is unspecified if it is called multiple times, or from the Avisynth runtime system (ScriptClip, ConditionalFilter, etc.)
The c parameter is ignored and is defined only to make the function work as desired when applied on a clip.
AVSTP provides plug-ins with a thread pool, an object that can perform tasks asynchroniously. This means you just need to split the main task into several smaller tasks able to work in parallel, send them all to the thread pool and wait for their completion. The thread pool will handle the job queue and dispatch the tasks to the available worker threads.
The src directory contains the following items:
avstp: source code to build avstp.dllavstp_example: a simple Avisynth plugin (average of two clips) showing how to use the AVSTP helpers.common: miscellaneous C++ helper classes and wrappers on the top of the main AVSTP API.If you use the main API, add avstp.lib to your project, or load the dll manually on startup to resolve the avstp_* symbols.
avstp.h should be the only included file you'll need for the basic operations.
If your programming language is C++, you can also use the AvstpWrapper singleton class to take care of finding the library, loading it and resolve the symbols.
It will use a single-threaded mode if the library cannot be found.
This method is recommended to access the API.
MTSlicer is a simple class for easily splitting a frame into horizontal bands, each processed by a thread.
This method is suited to a majority of plug-ins.
See the next section for more information.
Another class, MTFlowGraphSched, can handle a simple graph of dependencies between tasks.
Also, the conc namespace offers a set of tools for dealing with data in a concurrent environment while minimizing the locks.
More specifically, lock-free pools of objects are useful to handle temporary recyclable storage for the threaded tasks.
See ObjPool and ObjFactoryInterface.
A typical use of the low-level API looks like this:
avstp_create_dispatcher() at the beginning of a frame to process, or do it once in the filter constructoravstp_destroy_dispatcher() when done (or in the destructor).Each threaded operation looks like this:
avstp_enqueue_task() on all these tasksavstp_wait_completion() to wait for synchronisation.A task is made of two entities:
avstp_enqueue_task() returns.
Actually this is the case when AVSTP works in single-threaded mode._mm_empty() instruction.
Indeed there is not guarantee on the state of the FP/MMX registers,
since the thread could have been previously used by other tasks using the MMX instruction set without restoring the registers.Most filters can work by computing independently single pixels or small blocks of the destination frame.
These filters can be multithreaded by splitting the frame into as many horizontal bands as there are threads.
This is similar to the MT() function, but much more flexible because the processing functions can access the whole input frames, not just a window corresponding to the rendered band.
Filters that require first a global evaluation (for example computing the average luma of the input) don't fall directly into this category, but may be easily transformed to fit the concept.
For example, you can use a conc::AtomicInt<int64_t> to store the global luma sum, which will be updated at the end of each band processing.
Then use a second pass of band splitting to do the actual process using the calculated average.
For more complex data patitioning and merging, you can use the map/reduce concept.
Here is a step-by-step guide to build a multi-threaded filter with the MTSlicer class.
We'll assume you're alreay familiar with single-threaded Avisynth filter programming.
The following code is taken from the avstp_example project, which is a plug-in averaging the two input clips.
First, the include files.
There are the usual Windows.h and avisynth.h, more the AVSTP-related files.
Nothing fancy here, excepted the NOMINMAX definition, to put before the Windows header inclusion.
Indeed, Windows min and max macro definitions interfere with the standard std::min and std::max template functions contained in <algorithm>.
The latter functions are used in the library.
#define NOMINMAX #define WIN32_LEAN_AND_MEAN #include "Windows.h" #include "avisynth.h" #include "avstp.h" #include "MTSlicer.h"
In the filter class definition, we'll have to declare a structure containing data that all the tasks may need.
We'll put here mainly the current frame structure information.
You can put any data related to the current frame (input and output).
However _this_ptr is a mandatory member and must be declared as a pointer to your filter class.
class TaskDataGlobal
{
public:
AvstpAverage * _this_ptr;
::BYTE * _dst_ptr;
const ::BYTE * _sr1_ptr;
const ::BYTE * _sr2_ptr;
int _stride_dst;
int _stride_sr1;
int _stride_sr2;
int _w;
};
Separating the global frame data from the main filter class is MT-mode-1-safe.
If compatibility with MT mode 2 is enough for your needs, just put everything as members of your Avisynth filter class.
In this case, you can omit _this_ptr.
Then, we typedef the slicer, for easy later use.
The first template parameter is your filter class (actually the same as pointed by the _this_ptr).
The second one the data structure common to all slices.
typedef MTSlicer <AvstpAverage, TaskDataGlobal> Slicer;
Averaging frames can be on a plane basis: all planes are independant from each other.
Instead of starting N threads for a plane, waiting for their completion and starting again for the other planes, we can send all the tasks simultaneously and finally wait for the completion only once.
However, this requires multiple common structures, as the data for each plane are different.
A second consequence is that it also requires several slicers, because plane heights may be different (luma vs chroma).
So we bundle a TaskDataGlobal with a Slicer for easier access and storage.
class PlaneProc
{
public:
TaskDataGlobal _tdg;
Slicer _slicer;
};
Then we declare our main processing function. This is a simple member function that is fed with a specific structure containing the top and bottom lines to process as well as a pointer to the global structure. The function code is be described below.
void process_subplane (Slicer::TaskData &td);
The GetFrame() function.
First, we loop on planes to fill the structure and call start() on the slicer.
The height parameter is used by the slicer to generates the line numbers for each slice.
Generally, you may want to set it to the current plane height in pixels, because dividing the frame horizontally has a greater cache efficiency.
However, it really can be any unit you want.
For example, if you're processing the frame by blocks, it's better to count block rows instead of pixel rows, because you're sure that the frame will be sliced correctly.
If the blocks are huge in size (low number of blocks), you can count an absolute number of blocks to be sure that the frame will be equally sliced.
The second parameter is the global data to attach, and the third one the processing function. Note the funky way to get a pointer on a member function.
Don't forget to initialise _this_ptr, too.
In the second part, we loop on the slicers to wait for their respective task completion. The loop exits only when all the tasks are finished.
::PVideoFrame __stdcall AvstpAverage::GetFrame (int n, ::IScriptEnvironment *env_ptr)
{
::PVideoFrame fsr1_sptr = _src1_sptr->GetFrame (n, env_ptr);
::PVideoFrame fsr2_sptr = _src2_sptr->GetFrame (n, env_ptr);
::PVideoFrame fdst_sptr = env_ptr->NewVideoFrame (vi);
// Sets one slicer per plane, because planes may have different
// heights (YV12). We could have done it differently and handle
// the chroma subsampling from the plane processing function,
// but it would have been less trivial to deal with.
PlaneProc pproc_arr [3];
for (int p = 0; p < _nbr_planes; ++p)
{
static const int plane_id_arr [3] = { PLANAR_Y, PLANAR_U, PLANAR_V };
const int plane_id = plane_id_arr [p];
PlaneProc & pproc = pproc_arr [p];
// Collects the source and destination plane information
pproc._tdg._this_ptr = this;
pproc._tdg._dst_ptr = fdst_sptr->GetWritePtr (plane_id);
pproc._tdg._sr1_ptr = fsr1_sptr->GetReadPtr (plane_id);
pproc._tdg._sr2_ptr = fsr2_sptr->GetReadPtr (plane_id);
pproc._tdg._stride_dst = fdst_sptr->GetPitch (plane_id);
pproc._tdg._stride_sr1 = fsr1_sptr->GetPitch (plane_id);
pproc._tdg._stride_sr2 = fsr2_sptr->GetPitch (plane_id);
pproc._tdg._w = (vi.IsPlanar ())
? fdst_sptr->GetRowSize (plane_id)
: vi.RowSize ();
const int height = fdst_sptr->GetHeight (plane_id);
// Setup and run the tasks. We need to provide the slicer with
// the plane height so it can generate coherent top and bottom
// row indexes for each slice.
pproc._slicer.start (height, pproc._tdg, &AvstpAverage::process_subplane);
}
// Waits for all the tasks to terminate
for (int p = 0; p < _nbr_planes; ++p)
{
pproc_arr [p]._slicer.wait ();
}
return (fdst_sptr);
}
Finally, the processing function.
The td structure contains data specific to the slice.
It's private to the slice, not shared.
td._y_beg is the first row to process.
Important: the interval is half-open, so td._y_end is not the last row, it's the row after the last one.
Therefore td._y_end - td._y_beg gives the number of lines to process.
void AvstpAverage::process_subplane (Slicer::TaskData &td)
{
assert (&td != 0);
const TaskDataGlobal & tdg = *(td._glob_data_ptr);
// Sets our pointers to the slice top location
::BYTE * dst_ptr = tdg._dst_ptr + td._y_beg * tdg._stride_dst;
const ::BYTE * sr1_ptr = tdg._sr1_ptr + td._y_beg * tdg._stride_sr1;
const ::BYTE * sr2_ptr = tdg._sr2_ptr + td._y_beg * tdg._stride_sr2;
// Slice processing
for (int y = td._y_beg; y < td._y_end; ++y)
{
for (int x = 0; x < tdg._w; ++x)
{
dst_ptr [x] = ::BYTE ((sr1_ptr [x] + sr2_ptr [x] + 1) >> 1);
}
dst_ptr += tdg._stride_dst;
sr1_ptr += tdg._stride_sr1;
sr2_ptr += tdg._stride_sr2;
}
}
td._slicer_ptr, which is ignored here, can be used to access the dispatcher via the slicer object.
Thus, additionnal tasks may be enqueued, they will be attributed to the same group as the current task (and waited by the same call).
That's about everything you need to know to build a robust and efficient Avisynth multi-threaded plug-in.
There are tasks that are complex and cannot be decomposed in multiple parallel sub-tasks from the beginning to the end. Some calculations may have dependencies on other calculations, showing a lot of vertices on the processing flowgraph. Here are a few general tips:
avstp_wait_completion() only for the final node, or for global synchronisation points.conc::AtomicInt template class or just the InterlockedIncrement/InterlockedDecrement system functions.conc::ObjPool and implement conc::ObjFactoryInterface to share them efficiently between threads.All the following functions are declared in avstp.h.
They are all thread-safe, so they can be called concurrently from any thread.
If the call fails, a null pointer is returned, or a negative error code if the return value is an int.
int avstp_get_interface_version ();
Returns the interface version.
See the avstp_INTERFACE_VERSION value in the header.
avstp_TaskDispatcher * avstp_create_dispatcher ();
Call this function once to create a dispatcher, before processing anything.
The call must be paired with a call to avstp_destroy_dispatcher() to destroy the dispatcher.
void avstp_destroy_dispatcher (avstp_TaskDispatcher *td_ptr);
When you're done with multi-threading, call avstp_destroy_dispatcher() to release the dispatcher.
Not doing this may result in resource leaks.
Destroy the dispatcher only when all the associated tasks are known to be finished.
td_ptr is the dispatcher to destroy.
int avstp_get_nbr_threads ();
Returns the current number of threads.
The value may vary during the filter construction stage, depending on an avstp_set_threads() script function call.
int avstp_enqueue_task (avstp_TaskDispatcher *td_ptr,
avstp_TaskPtr task_ptr,
void *user_data_ptr);
Schedules the asynchronious execution of a task. td_ptr is the dispatcher which the task will be attached to. task_ptr and user_data_ptr are respectively a pointer to the function to execute and a pointer to its private data, given as parameter. You can enqueue new tasks at any moment, even if the main threads has started waiting.
int avstp_wait_completion (avstp_TaskDispatcher *td_ptr);
When called, the function will wait for the completion of all the tasks attached to the dispatcher. If you do not want to wait for some tasks, create a second dispatcher for them. The function will return immediately if no task have been previously scheduled.
v1.0.1, 2012.05.27
v1.0.0, 2012.03.11